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Abstract 

There is a growing realization that uncertain information is a first-class citizen in modern database 
management. As such, we need techniques to correctly and efficiently process uncertain data in database 
systems. In particular, data reduction techniques that can produce concise, accurate synopses of large 
probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact 
probabilistic data synopses can form the foundation for human understanding and interactive data explo- 
ration, probabilistic query planning and optimization, and fast approximate query processing in proba- 
bilistic database systems. 

In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based 
synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries 
or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of 
probabilistic tuples under a given error metric. For a variety of different error metrics, we devise effi- 
cient algorithms that construct optimal or near optimal _B-term histogram and wavelet synopses. This 
requires careful analysis of the structure of the probability distributions, and novel extensions of known 
dynamic-programming-based techniques for the deterministic domain. Our experiments show that this 
approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data 
distribution, while taking equal or less time. 

1 Introduction 

Modern real-world applications generate massive amounts of data that is often uncertain and imprecise. For 
instance, data integration and record linkage tools can produce distinct degrees of confidence for output data 
tuples (based on the quality of the match for the underlying entities) |7|; similarly, pervasive multi-sensor 
computing applications need to routinely handle noisy sensor/RFID readings |[22]| . Motivated by these 
new application requirements, recent research efforts on probabilistic data management aim to incorporate 
uncertainty and probabilistic information as "first-class citizens" of the database system. 

Among different approaches for modeling uncertain data, tuple- and attribute -lev el uncertainty models 
have seen wide adoption both in research papers as well as early system prototypes |[T] |2j |3l. In such 
models, the attribute values for a data tuple are specified using a probability distribution over different 
mutually-exclusive alternatives (that might also include non-existence, i.e., the tuple is not present in the 
data), and assuming independence across tuples. The popularity of tuple -/attribute-level uncertainty models 
is due to both their simplicity of representation in current relational systems, as well as their intuitive query 
semantics. In essence, a probabilistic database is a concise representation for a probability distribution over 
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an exponentially-large collection of possible worlds, each representing a possible "grounded" (deterministic) 
instance of the database (e.g., by flipping appropriately-biased independent coins to select an instantiation 
for each uncertain tuple). This "possible-worlds" semantics also implies clean semantics for queries over 
a probabilistic database — essentially, the result of a probabilistic query defines a probability distribution 
over the space of possible query results across all possible worlds |7]. The goal of query answering over 
uncertain data is to provide the expected value of the answer, or tail bounds on the distribution of answers. 

Unfortunately, despite its intuitive semantics, the paradigm shift towards tuple-level uncertainty also im- 
plies computationally-intractable #P-hard data complexity even for simple query processing operations 17]. 
These negative complexity results for query evaluation raise some serious practicality concerns for the ap- 
plicability of probabilistic database systems in realistic settings. One possible avenue of attack is through 
the use of approximate query processing techniques over probabilistic data. As in conventional database 
systems, such techniques need to rely on effective data reduction methods that can effectively compress 
large amounts of data down to concise data synopses while retaining the key statistical traits of the original 
data collection f9l. It is then feasible to run more expensive algorithms over the much compressed represen- 
tation, and still obtain a fast and accurate answer. In addition to enabling fast approximate query answers 
over probabilistic data, such compact synopses can also provide the foundation for human understanding 
and interactive data exploration, and probabilistic query planning and optimization. 

The data-reduction problem for deterministic databases is well understood, and several different synop- 
sis construction tools exist. Histograms [17] and wavelets I4i are two of the most popular general-purpose 
data-reduction methods for conventional data distributions. It is therefore meaningful and interesting to 
build histogram and wavelet synopses of probabilistic data. Here, histograms, as on traditional "determin- 
istic" data, divide the given input data into "buckets" so that all tuples falling in the same bucket have 
similar behavior: the bucket boundaries are chosen to minimize a given error function that measures this 
within-bucket dissimilarity. Likewise, wavelets represent the probabilistic data by choosing a small number 
of wavelet basis functions which best describe the data, and contain as much of the "expected energy" of 
the data as possible. So for both histograms and wavelets, the synopses aim to capture and describe the 
probabilistic data as accurately as possible given a fixed size of synopsis. This description is clearly of use 
both to present to users to compactly show the key components of the data, as well as in approximate query 
answering and query planning. As in the large body of work on synopses for deterministic data, we consider 
a number of standard error functions, and show how to find the optimal synopsis relative to this class of 
function. 

There has been much recent work that has explored many different variants of the basic synopsis prob- 
lem, and we summarize the main contributions in Section|2] Unfortunately, this work is not applicable when 
moving to probabilistic data collections that essentially represent a huge set of possible data distributions 
(i.e., "possible worlds"). It may seem that probabilistic data can be thought of defining a weighted instance 
of the deterministic problem, but we show empirically and analytically that this yields poor representa- 
tions. Thus, building effective histogram or wavelet synopses for such collections of possible-world data 
distributions is a challenging problem that mandates novel analysis and algorithmic approaches. 

Our Contributions. In this paper, we provide the first formal definitions and algorithmic solutions for 
constructing histogram and wavelet synopses over probabilistic data. In particular: 

• We define the "probabilistic data reduction problem" over a variety of cumulative and maximum 
error objectives, based on natural generalizations of histograms and wavelets from deterministic to 
probabilistic data. 

• For histograms, we give efficient techniques to find the optimal histogram under all common cumula- 
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tive error objectives (such as sum-squared error and sum-relative error), and corresponding maximum 
error objectives. We also show that fast approximate solutions are possible. Our techniques rely on 
careful analysis of each objective function in turn, and showing that the cost of a given bucket, along 
with its optimal representative value, can be found efficiently from appropriate precomputed arrays. 
These results require careful proof, due to the distribution of values each item can take on, and the 
potential correlations between items. 

• For wavelets, we similarly show optimal techniques for the core sum-squared error (SSE) objective. 
Here, it suffices to compute the wavelet transformation of a deterministic input derived from the prob- 
abilistic input. We also show how to extend algorithms from the deterministic setting to probabilistic 
data for non-SSE objectives. 

• We report on experimental evaluation of our methods, and show that they achieve appreciably better 
results than simple heuristics. The space and time costs are always equal or better than the heuristics. 

After surveying prior work on uncertain data, we describe the relevant data models and synopsis objec- 
tives in Section |2] Our results on Histograms and Wavelets are in Sections [3] and |4] respectively. We show 
experimental analysis of our techniques in Section[5j and then suggest some future directions. 

1.1 Prior Work on Uncertain Data 

Key ideas in probabilistic databases are presented in tutorials by Dalvi and Suciu Il24l [8)1. and built on by 
systems such as Trio fT\ and MayBMS [ 1]. Initial research has focused on how to store and process uncertain 
data within database systems, and hence how to answer SQL-style queries. Subsequently, there has been a 
growing realization that in addition to storing and processing uncertain data, there is a need to run advanced 
algorithms to analyze and mine uncertain data. Recent work has analyzed how to compute properties of 
streams of uncertain tuples such as the expected average and number of distinct items m |20l ElJ ; how to 
cluster uncertain data [6]; and finding frequent items within uncertain data I26J . 

To the best of our knowledge, no prior work has studied problems of building histogram and wavelet 
synopses of probabilistic data. There has been work on finding quantiles of unidimensional data E 1211 . 
which can be thought of as the equi-depth histogram; the techniques to find these show that it simplifies to 
the problem of finding quantiles over weighted data, where the weight of each item is simply its expected 
frequency. Similarly, finding frequent items is somewhat related to finding high-biased histograms. Lastly, 
we can also think of building a histogram as being a kind of clustering of the data along the domain. How- 
ever, the nature of the error objectives on histograms that are induced by the formalizations of clustering 
are quite different from here, and so probabilistic clustering techniques [6| do not give good solutions for 
histograms. 

2 Preliminaries and Problem Formulation 
2.1 Probabilistic Data Models 

A variety of models of probabilistic data have been proposed. The different models capture various levels 
of independence between the individual data values described. Each model describes a distribution over 
possible worlds: each possible world is a (traditional) relation containing some number of tuples. The most 
general model describes the complete correlations between all tuples; effectively, it describes every possible 
world and its associated probability explicitly. However, the size of such a model for even a moderate 
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number of tuples is immense, since the exponentially many possible combinations of values are spelled out. 
In practice, finding the (exponentially many) parameters for the fully general model is unfeasible; instead, 
more compact models are adopted which can reduce the number of parameters by making independence 
assumptions between tuples. 

The simplest probabilistic model is the basic model, which consists of a sequence of tuples containing a 
single value and the probability that it exists in the data. More formally, 

Definition 1. The basic model consists of a set of m tuples where the jth tuple consists of a pair {tj,pj). 
Here, tj is an item drawn from a fixed domain, and pj is the probability that tj appears in any possible 
world. Each possible world W is formed by the inclusion of a subset of the items tj. We can write j £ W if 
tj is present in the possible world W, and j ^ W otherwise. Each tuple is assumed to be independent of all 
others, so the probability of a possible world W is given by 

jew j^w 

Note here that the items tj can be somewhat complex (e.g. a row in a table), but without loss of generality 
we will treat them as simple objects. In particular, we will be interested in cases that can be modeled as 
when the tjS are drawn from a fixed, ordered domain (such as 1, 2, 3 . . .) of size n, and several tjS can 
correspond to occurrences of the same item. We consider two extensions of the basic model, which each 
capture dependencies not expressible in the basic model by providing a compact discrete probability density 
function (pdf) in place of the single probability. 

Definition 2. In the tuple pdf model, instead of a single (item, probability) pair, there is a set of pairs with 
probabilities summing to at most 1. That is, the input consists of a sequence of tuples tj £ T of the form 
{{tji,pji), . . . {tj£,pji)). The interpretation is that each tuple specifies a set of mutually exclusive possible 
values for the ith row of a relation. We require that the sum of the probabilities within a tuple is at most 1; 
if less than unity, then the remaining probability measures the chance that there is no corresponding item. 
We can interpret this as describing a discrete pdf for the jth item in the input as Pr[tj = tji\ = pji, and so 
on. Each tuple is assumed to be independent of all others, so that the probability of any possible world can 
be computed via careful multiplication of the relevant probabilities. I 

This model has been widely adopted, and is used in the TRIO [2] work and elsewhere ||2T]| . It captures 
the case where an observer makes readings, and has some uncertainty over what was seen. An alternate case 
is when an observer makes readings of a known item (for example, this could be a sensor making discrete 
readings), but has uncertainty over a value or frequency associated with the item: 

Definition 3. The value pdf model consists of a sequence of tuples of the form {i : {fn^pn) ■ ■ ■ {fuiPu)), 
where the probabilities in each tuple sum to at most 1. The interpretation is that tuple specifies the distri- 
bution of frequencies of a separate item; the distributions of different items are assumed to be distinct. We 
can interpret this as describing a discrete pdf for the random variable gi giving (say) the distribution of the 
frequencies of the ith item: Pr[(7j = /ji] = pn, and so on. Due to the independence, the probability of 
any possible world is computed via multiplication of probabilities for the frequency of each item in turn. If 
probabilities in a tuple sum to less than one, the remainder is taken to implicitly specify the probability that 
the frequency is zero, for easy compatibility with the basic model. Let the set of all values of frequencies 
used be V, so every (^f,p) pair has f £ V. I 

For both the basic and tuple models, the frequency of any given item i within a possible world, gi, is 
a non-negative integer, and each occurrence corresponds to a tuple from the input. The value pdf model 
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can specify arbitrary fractional frequencies, but the number of such frequencies is bounded by the size of 
the input, m. Note that the basic model is a special case of the tuple pdf and value pdf model, but that 
neither of these two is contained within the other. However, input in the tuple pdf induces a distribution over 
frequencies of each item, so we can define the induced value which, for each item i, provides Pr[gi = v] 
for some u G V. The important detail is that, unlike in the value pdf model, these induced pdfs are not 
independent; nevertheless, this representation is useful in our subsequent analysis. For data presented in the 
tuple pdf format, it is straightforward to build induced the value pdf for each value inductively, taking time 
0(| V|) to update the value pdf built so far. The space required is linear in the size of the input, 0{m). 

Definition 4. Given an input in any of these models, let W denote the space of all possible worlds, and 
Pr[M^] denote the probability associated with possible world W G W. We can then compute the expectation 
of various quantities over possible worlds as, given a function f which can be evaluated on a possible world 
W, 

Ew[/] = Yl p^wum (1) 

Example 1. Consider the ordered domain containing the three items 1, 2, 3. The input (1, g), (2, |), (2, \), (3, \) 
in the basic model defines the following twelve possible worlds and corresponding probabilities: 
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The input ((1, g), (2, |)), ((2, (3, \)) in the tuple pdf model defines the following eight possible 
worlds and corresponding probabilities: 
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The input (1 : (1, ^)), (2 : (1, |), (2, j)), (3 : (1, |)) in the value pdf model defines the pdfs 

Pr[5i = 0] = i,Prbi = l] = i 

Pr[52 = 0] = ^, Pr[<72 = 1] = ^ Pr[52 = 'A = \ 

Prb3 = 0] = i,Prb3 = l] = | 

and hence provides the following distribution over twelve possible worlds: 
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In all three cases, £-y\![gi\ = \. In the value pdf case, Evy[(72] = \, for the other two cases ^v\![g2\ = 

I 

Although two possible worlds may be formed in different ways, they may be indistinguishable: for 
instance, in the basic model example, the world W = {2} can result either from the second or third tuple. 
We will typically not distinguish possible worlds based on how they arose, and so treat them as identical. 

Our input is then characterized by parameters n, giving the size of the ordered domain from which the 
input is drawn, m, the total number of pairs in the input (hence the input can be described with 0{m) pieces 
of information), and V, the set of values that the frequencies take on. Here |V| <m, but could be much less. 
In the all three examples above, we have n = 3, m = 4, and V = {0, 1, 2}. 
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2.2 Histogram and Wavelet Synopses 

We review key techniques for synopses on deterministic data. 

Histograms on Deterministic Data. Consider a one-dimensional data distribution defined (without loss of 
generality) over the integer domain [n] = {0, . . . , n — 1}. For each i € [n], we let gi denote the frequency of 
domain value i in the underlying data set A histogram synopsis provides a concise, piece-wise approximate 
representation of the distribution based on partitioning the ordered domain [n] into B buckets. Each bucket 
bk consists of a start and end point, bk = (sfc,efc), and approximates the frequencies of the contiguous 
subsequence of values {sk, Sk+i, • • • , ek} (termed the span of the bucket) using a single representative 
value bk- We also let (e^ — + 1) denote the width (i.e., number of distinct items) of bucket bj.. The 
B buckets in a histogram form a partition of [n]; that is, si = 0, = n — 1, and Sk+i = 6^ + 1 for all 
k = l,...,B -1. 

By using 0{B) <^ n space to represent an 0(n)-size data distribution, histograms provide a very 
effective means of data reduction, with numerous applications [9]. This data reduction also implies approx- 
imation errors in the estimation of frequencies, since each £ b^ is estimated as gi = b^. The histogram 
construction problem is, given a storage budget B, build a B-bucket histogram Hb that is optimal un- 
der some aggregate error metric. Important histogram error metrics to minimize include, for instance, the 
Sum-Squared-Error 

ssE(w) = Y.-=M - = Eti ntsM - ^^k)' 

(which defines the important class of V-optimal histograms ifTSlfTOl ) and the Sum-Squared-Relative-Error. 

SSREfH) = ^^rf'^' 

(where the sanity-bound constant c in the denominator is used to avoid excessive emphasis being placed 
on small frequency values ifTOl IT6ll ). The Sum-Absolute-Error (SAE) and Sum-Absolute-Relative-Error 
(SARE) are defined similarly to SSE and SSRE, replacing the square with an absolute value so that 

SAE(W) = y^\gi- gi\ and SARE(W) = V '^'~f' 

In addition to such cumulative metrics, maximum error metrics have also been employed in order to pro- 
vide approximation guarantees on the relative/absolute error of individual frequency approximations [T6l ; 
these include, for example, Maximum-Absolute-Relative-Error MARE('H) = maxjg[„] niax{ctg |} " 

Histogram construction satisfies the principle of optimality: If the B^^ bucket in the optimal histogram 
spans the range [i + 1 , n — 1] , then the remaining B — 1 buckets must form an optimal histogram for the 
range [0, i]. This immediately leads to a Dynamic-Programming (DP) algorithm for computing the optimal 
error value OPTH[j, b] for a 6-bucket histogram spanning the prefix [1, j] based on the following recurrence: 

OPTH[j,6]= min{/i(0PTH[/,6- l],min{BERR([/ + (2) 

0<i<j b 

where BErr([x, y], z) denotes the error contribution of a single histogram bucket spanning [x, y] using a 
representative value of z to approximate all enclosed frequencies, and y) is simply x -\-y (respectively, 
max{x, y}) for cumulative (resp., maximum) error objectives. The key to translating the above recurrence 
into a fast histogram construction algorithm lies in being able to quickly find the best representative b and 
the corresponding optimal error value BErr() for the single-bucket case. 
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For example, in the case of the SSE objective the representative value minimizing a bucket's SSE con- 
tribution is exactly the average bucket frequency = — ^—^ — giving an optimal bucket SSE contribution 



the SSE contribution for any bucket in the DP recurrence can be computed in 0(1) time, giving rise 
to an 0{n?B) algorithm for building the SSE-optimal (or, V-optimal) S-bucket histogram [19|. Similar 
ideas apply for the other error metrics discussed above. For instance, the optimal MARE contribution for 
a single bucket depends only on the maximum/minimum frequencies. Using appropriate precomputed data 
structures on dyadic ranges leads to an efficient, 0{n\o^ nB) DP algorithm for building MARE-optimal 
histograms |[T6l . 

Wavelets on Deterministic Data. Haar wavelet synopses ||4l[TTl[T2j|25l provide another data reduction tool 
based on the Haar Discrete Wavelet Decomposition (DWT) Il23l for hierarchically decomposing functions. 
At a high level, the Haar DWT of a data distribution over [n] consists of a coarse overall approximation (the 
average of all frequencies) together with n — 1 detail coefficients (constructed through recursive pairwise 
averaging and differencing) that influence the reconstruction of frequency values at different scales. The 
Haar DWT process can be visualized through a binary coefficient tree structure, like the one depicted in 
Figure [T] for an example frequency array A = [2,2,0,2,3,5,4,4]: Leaf nodes gi correspond to the original 
data distribution values in ^[] . The root node cq is the overall average frequency, whereas each internal node 
Cj (i = 1 , . . . , 7) is a detail coefficient computed as the half the difference between the average of frequencies 
in Cj's left child subtree and the average of frequencies in Cj's left child subtree (e.g., C3 = ^C^^— = 
0). Coefficients in level / are normalized by a factor of this has the effect of making the transform 
into an orthonormal basis [23], so that the sum of squares of coefficients equals the sum of squares of the 
original data values. 

Any data value gi can be reconstructed as a function of the coefficients which are proper ancestors of the 
corresponding node in the coefficient tree: the reconstructed value can be found by summing appropriately 
scaled multiples of these log + 1 coefficients alone. The support of a coefficient q is defined as the 
interval of data values that Cj is used to reconstruct; it is a dyadic interval of size 2^°s"-' for a coefficient at 
resolution level I (see Fig.[T]). 

Given a limited amount of space for maintaining a wavelet synopsis W, a thresholding procedure retains 
a certain number B <^noi the Haar coefficients as a highly-compressed approximate representation of the 
original data (the remaining coefficients are implicitly set to 0). Similar to histogram construction, the 
aim is to determine the "best" subset of B coefficients to retain, so that some overall error measure in the 
approximation is minimized. By the orthonormality of the normalized Haar basis, greedily picking the B 
largest coefficients (based on absolute normalized value) is provably optimal for the SSE error metric f23\. 
Recent work proposes schemes for optimal and approximate thresholding under different error metrics. The 
key idea behind these schemes is to formulate a dynamic program over the coefficient-tree structure that 
tabulates the optimal solution for a subtree rooted at node Cj given the contribution from the choices made at 
the proper-ancestor nodes of cj in the coefficient tree. This can handle a broad, natural class of distributive 
error metrics (including, for instance, all the error measures discussed above, as well as weighted Lp-norm 
error for arbitrary p) lfTn[T2l . 

There are two distinct versions of the thresholding problem for non-SSE error metrics. In the restricted 
version the thresholding algorithm is forced to select values for the synopsis from the standard Haar coeffi- 




i=o9i for each j = 0, . . . 
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Figure 1: Error-tree structure for data distribution array A = [2, 2, 0, 2, 3, 5, 4, 4], n 
resolution levels.) 
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cient values (computed as discussed above). This restriction can lead to sub-optimal synopses for non-SSE 
error [12]. In the more general, unrestricted version of the problem, retained coefficient values are chosen 
to optimize the target error metric ITU . Let OPTw[j, b, v] denote the optimal error contribution across all 
frequencies gi in the support (i.e., subtree) of coefficient cj assuming a total space budget of b coefficients 
retained in c/s subtree; and, a (partial) reconstructed value of v based on the choices made at proper an- 
cestors of Cj. Then, based on the Haar DWT reconstruction process, we can compute OPTw[j, b, v] as the 
minimum of two alternative error values at cj : 

(1) Optimal error when retaining the best value for cj, found by minimizing over all possible values Vj for 
Cj and allotments of the remaining budget across the left and right child of cj, i.e. 



0PTw,[j,6,i;] = 

min {h{OPTw[2j,b',v 

Vj ,0<b'<b-l 



,],OPTw[2j + l,b-b'-l,v-v,])}. 



(2) Optimal error when not retaining Cj, computed similarly: 

OPTWnr[j,b,v] = min {/i(OPTw[2j,6',w],OPTw[2j + l,6-6',u])} 

Vj ,0<b' <b 

where h{) stands for summation (max{}) for cumulative (resp., maximum) error-metric objectives. In the 
restricted problem, the minimization over Vj is eliminated (since the value for cj is fixed), and the values 
for the "incoming" contribution v can be computed by stepping through all possible subsets of ancestors 
for Cj — since the depth of the tree is O(logn), this implies an O(n^) thresholding algorithm lilU . In 
the unrestricted case, Guha and Harb propose efficient approximation schemes that employ techniques for 
bounding and approximating the range of possible v values |[T2l . 
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2.3 Probabilistic Data Reduction Problem 

The key difference when moving to the probabilistic data setting is that data-distribution frequencies gi 
(and, similarly, the corresponding Haar DWT coefficients) are now random variables that are instantiated 
for each individual possible world. We use gi{W) {ci{W)) to denote the (instantiated) frequency of the 
i^^ item (resp., value of the i*"^ Haar coefficient) in possible world W — we sometimes omit the explicit 
dependence on W for the sake of conciseness. This also implies that the error of a given data synopsis is 
now a random variable over the collection of possible worlds. Thus, our goal naturally becomes that of 
constructing a data synopsis S G {H, W} that optimizes an expected measure of the target error objective 
over possible worlds. More formally, let err{gi, gi) denote the error of approximating gi by gi (e.g., squared 
error or absolute relative error for item i); then, our problem can be formulated as follows. 

[Synopsis Construction for Probabilistic Data] Given a collection of probabilistic attribute values, a syn- 
opsis space budget B, and a target (cumulative or maximum) error metric, determine a size-S synopsis that 
minimizes either (1) the expected cumulative error over all possible worlds, i.e., Eyv;[^j err{gi,gi)] (in 
the case of a cumulative error objective); or, (2) the maximum value of the per-item expected error over all 
possible worlds, i.e., maxj{Evy[err(g'j, ^j)]} (for a maximum error objective). I 

A natural first attempt to solve the probabilistic data reduction problem is to look to prior work, and 
ask whether techniques based on sampling, or building a weighted deterministic data set, will suffice. More 
precisely, one could imagine sampling a possible world W with probability Pr[Ty] and building the optimal 
synopsis for W; or for each item i finding Eyyifi'i], and building the synopsis of the "expected" data. Our 
subsequent analysis shows that such attempts are insufficient. We give precise formulations of the optimal 
solution to the problem under a variety of error metrics, and one can verify by inspection that they do not 
in general correspond to any of these simple solutions. Further, we compare the optimal solution to these 
solutions in our experimental evaluation, and observe that the quality of the solution found is substantially 
poorer. Observe that this stands in contrast to prior work on estimating functions such as expected number 
of distinct items 121 EH, which analyzes the number of samples needed to give an accurate estimate. This 
is because our synopses are not scalar values, and so it is not meaningful to sample many possible worlds, 
build the synopsis on each, and then find the "average" of these synopses. 

3 Histograms on Uncertain Data 

We primarily consider producing optimal histograms for probabilistic data under cumulative error objec- 
tives. These minimize the expected cost of the histogram over all possible worlds. Our techniques are based 
on applying the dynamic programming approach, and therefore most of our effort is in showing how to to 
compute the optimal b for a bucket h under a given error objective, and also to compute the corresponding 
value of Ew[BErr(6, h)]. Here, we observe that the principle of optimality still holds even under uncertain 
data: since the expectation of the sum of costs of each bucket is equal to the sum of the expectations, remov- 
ing the final bucket should leave an optimal B — 1 bucket histogram over the prefix of the domain. Hence, 
we will be able to invoke equation Q, and find a solution which requires evaluating 0{Bri?) possibilities. 

3.1 Sum-squared error 

The sum-squared error measure SSE is the sum of the squared differences between the values within a 
bucket bk and the representative value of the bucket, bk- For a fixed possible world W, the optimal value 
for bi is the mean of the frequencies gi in the bucket, and the measure reduces to a multiple of the sample 
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variance of the values within the bucket. This holds true even under probabilistic data, as we show below. 
For deterministic data, it is straightforward to quickly compute the (sample) variance within a given bucket, 
and therefore use dynamic programming to select the optimum bucketing [19|. In order to use the same 
approach over uncertain data, which specifies exponentially many possible worlds, we need to be able 
efficiently compute the variance within a given bucket b specified by start point s and end point e. 

We first define some necessary quantities. Given the (fixed) discrete frequency distribution implied by 
W, and bucket b of span Ub, the sample variance of W, cr^(VF) is defined in terms of gi{W), the frequency 
of item i in world W, as 

l=S \l = S / \l = S / 
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Given a distribution of possible worlds where Pr[14^] is the probability of possible world W, the variance 

VarwW = EwK^] = ^ a2(t^)Pr[W^] (4) 

This follows by substituting the variance function into equation ([T]). 
Fact 1. Under the sum-squared error measure, the cost is minimized by setting b = ^w[Y2i=s 9i\ ~ ^• 
Proof. The cost of the bucket is 

e e 

SSE(6, b) =EwE(<7* - b)'] = E>vE(5i -b + b- bf] 

i=s i=s 

e 

= EM{9i - bf] + Ew[2(fe - b){gi -b) + {b- b)'] 

i=s 

e 

=nbiy^ry,{b) + 2(6 - 6)(EwE 9i] -b) + {b- If) 

i=s 

=n;,(Varw(6) + (6 - If) 

Since the second term is always positive, it is minimized by setting b = b, yielding the cost as simply 
nbVar>v(6). □ 

Using equations Q and Q, we can write SSE(6, b) = n;,Vary\;(6) as the combination of two terms: 
SSE(6,-6)= ^ Pr[W]±g.iW)^- ^E^^W)' 

VKeW i=s W&W ^ i=s 

G 6 

= YEM9f]-^^w[{Y.9^)'] (5) 

i=s i=s 

The first term is the expectation (over possible worlds) of the sum of squares of frequencies of each item 
in the bucket. The second term can be interpreted as the expected square of the weight of the bucket, scaled 
by the span of the bucket. We show how to compute each term efficiently in our different models. 
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Value pdf model. In the value pdf model, we have a distribution for each item i over frequency values 
Vj G V giving Pr[(jrj = Vj]. By independence of the values, we have immediately 

6 6 

i=s i=s Vj(^V 

Since for any random variable X we have E[X'^] = Var[X] + E[X]^, we can use linearity of expectation 
and summation of variance to find the second term in equation Q as 

e e e 

Ew[£9*)'] = EwE<7i]' + VarwE<7i] (6) 

i=s i=s i=s 

e e 

= iY.Yl ^''^Si = Vj]vjf + ^ Varwbi] 

i=s Vj£V i=s 

where, in turn 



Tuple pdf model. In the tuple pdf case, things appear more involved, because of the interactions between 
items in the same tuple. As shown by equation ([5]), we need to compute Eyvbj^] and Eyv[(X]i 9i)'^]- Let the 
set of tuples in the input be T = {tj}, so that each tuple has an associated pdf giving Pr[tj = i], from which 
can be derived Pr[a < tj < b], the probability that the ith tuple in the input is an item between a and b in 
the input domain. 

Ew[9i] =VarH;[5i] + {Ey^[gi])'^ 

Here, we rely on the fact that variance of each gi is the sum of the variances arising from each tuple 
in the input. Observe that although there are dependencies between particular items, these do not affect 
the computation of the expectations for individual items, which can be then be summed to find the overall 
answer. For the second term in Q, we can use the expression ([6]), and we already have an expression 
for (Ew[X^j But we cannot simply write Varw[^i=s 9i\ °f variances, since these are no 

longer independent variables. So instead, we treat all items in the same bucket together as a single item and 
compute its expected square by iterating over all tuples in the input T: 

Varw[Ei=s5^] = Et,erPr[s < t, < e](l - Pr[s < t, < e]) 

Example. Consider the input ((1, (2, ^)), ((2, |), (3, |)) in the tuple pdf model from Example [l] We 
compute 

p r 21 _l_i_2|_3_|l|l2,^2 i2_252 
l^i=l "^VVLyiJ — 4"'"9~'"16~'"4"'"2 "'"12 """2 "144" 

Similarly, Ew[{ELi9if] = M + le + H + I? = = if - Combining these, we find the 

variance of the bucket 1 ... 3 as y|| — = ||. The same value can be obtained by computing the 

expected sample variance over all possible worlds shown in Example [T] □ 
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Efficient Computation, The above development shows how to find the cost of a specified bucket. Comput- 
ing the minimum cost histogram requires comparing the cost of many different choices of buckets. As in the 
deterministic case, since the cost is the sum of the costs of the buckets, the dynamic programming solution 
can find the optimal cost. This computes the cost of the optimal j bucket solution up to position £ combined 
with the cost of the optimal k — j bucket solution over positions ^ + 1 to n. This means finding the cost of 
O(n^) buckets. By analyzing the form of the above expressions for the cost of a bucket, we can precompute 
enough information to allow the cost of any specified bucket to be found in time 0(1). 

For the tuple pdf model (the value pdf model is similar but simpler) we precompute arrays of length n 
containing the following information: 

= E ( E p^fe = ^1(1 - = ^]) + ( E = ^1)') 

B[e\ = Prfe < e] C[e\ = ^ (Pr[t,- < e\f 

for 1 < e < s, and set ^[0] = 5[0] = C[0] = 0. Then the cost SSE((s, e), 6) is (after some symbolic 
manipulation) given by 

A[e] - A[s - 1] - (i?[e]-g[^-l])(Bfe]+B[s--l]+l)-(C[e]-C[s-l]) 

With the input in sorted order, these three arrays can be computed with a linear pass over the input. We 
therefore conclude. 

Theorem 1. Optimal sum-squared error histograms can be computed over probabilistic data presented in 
the value pdf or tuple pdf models in time 0{m + Bn?). 

3.2 Sum-Squared-Relative-Error 

The sum of squares of relative errors measure, SSRE, over deterministic data computes the difference be- 
tween b, the representative value for the bucket, and each value within the bucket, and reports the square of 
this difference as a ratio to the square of the corresponding value. Typically, an additional 'sanity' parameter 
c is used to limit the value of this quantity in case some values in the bucket are very small. With probabilis- 
tic data, we are interested in the expected value of this quantity over all possible worlds. So, given a bucket 
b, the cost is 

SSRE(6,S) = Ew[E^^V2^] 

By linearity of expectation, the cost for a bucket given b can be computed by evaluating at all values of 
gi which have a non-zero probability (i.e. at all v G V). 

Value pdf model. We can write the sum of squared relative error cost in terms of the probability that, over 
all possible worlds, the ith item has frequency vj G V. Then 

SSRE(M) = tEP*. = ..l;;^p^ (7) 
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We can rewrite this using the function w{x) = 1/ max(c^, x^), which is a fixed value once x is specified. 
Now our cost is 



SSRE(6,5) =^Y1 = vM^j)^] 

i=s tijSV 

- 2Pr[5i = Vj\w{vj)vjh + Pr[gi = Vj\w{vj)b^) 
which is a quadratic in b. Simple calculus demonstrates that the optimal value of b to minimize this cost is 

b = 

Substituting this value of b gives SSRE(6, b) = 

In order to compute the cost efficiently, we can compute and store the following arrays: 



^[e] = Ei=l E^,,Gy P^bi = Vj\VjW{Vj) 



From these values, the cost for any given bucket specified by s and e can be computed in constant time 

as 

min SSE((., e), 6) = X[e] - X[s - 1] - ^^^t^-zls'-t 
The standard dynamic programming then finds the optimal set of buckets. 

Tuple pdf model. Observe that for this cost measure, the cost for the bucket b given by equation ^ is 
the sum of costs obtained by each item in the bucket. We can focus solely on the contribution to the cost 
made by a single item i, and observe that equation ^ depends only on the (induced) distribution giving 
Pr[gj = Vj\: there is no dependency on any other item. As a consequence, we can simply compute the 



induced value pdf (Section 2.1 1 for each item independently, and apply the above analysis. 



Theorem 2. Optimal sum squared relative error histograms be can computed over probabilistic data pre- 
sented in the value pdf model in time 0{m + Bn?) and 0{m\V\ + Bv?') in the tuple pdf model. 

3.3 Sum-Absolute-Error 

As before, V is the set of possible values taken on by the g^jS, indexed so that vi < V2 < ■ ■ ■ < f|v|- Given 
some b, let j' satisfy Vjt < b < Wj'+i (we can insert 'dummy' values of = and f |v|+i = oo if 6 falls 



13 



outside ofvi... v\y\). The sum of absolute errors is given by 

e 

SAEib,b) = J2T.^'i9i = vj]\b-Vj\ 

i=s Vj&V 
e 

^(S - Vf) Pr[gi < Vj>] + {vf+i - b) Pr[gi > Vj'+i] 

i=s 

+ E 



Priffi < Vj]{vj+i - Vj) ifvj < Vji 
Pr [ft > Vj] {vj+i - Vj) if Vj > Vj, 



The contribution of the first two terms can be written as 

{vji+i - Vji) Pr[ft < Vji] + (6 - Vjij^i){PT[gi < Vj>] - Pr[ft > Vj'+i]) 

This gives a quantity that is independent of b added to one that depends linearly on Pr[ft < Vj/] —Pi[gi > 
Vj'+i] , which we define as Aj/ . So if Aji > 0, we can reduce the cost by making b closer to f jz+i ; if Aji < 0, 
we can reduce the cost by making b closer to Vj'. Therefore, the optimal value of b occurs when we make it 
equal to some Vj value (since, when Aji = 0, we have the same result when setting b to either Vji or u^'+i, 
or anywhere in between). So we assume that b = Vji for some Vji G V and can state 

SAE(6 6) = V V {^^^^' - ""jl^^J+i ~ ^J') ^ > 

l^svjtv IP^t^i > Vj]{Vj+^ - Vj) if < Vj 

Define Pj,s,e = Yl'i=s ^^Qi ^ '"j] Pjs e ~ Si=s ^^[di > ^j]- Observe that Pj,s,e is monotone increas- 
ing in j while ^ ^ is monotone decreasing in j. So we have 

SAE(6, b) = Pj,s,e{vj+1 - Vj) + E iPls,e){Vj+l " V,) (8) 
Vj <b Vj >b 

Now observe that we have a contribution of (f j+i — Vj) for all j values, multiplied by either P[j, s, e] 
or P*[j, s, e]. Consider the effect on SAE if we step b through values vi,V2 - ■ ■ f|v|- We have 

SAE(6,^;^+i) - SAE(6,^;^) = (P^,,,e - Pl+i,s,e)ive+i - ve) 

Because Pj,s,e is monotone increasing in j, and P*g ^ is monotone decreasing in j, the quantity Pi^s,e — 
Pl+i se^^ monotone increasing in £. Thus SAE(6, b) can have a single minimum value as b is varied, and 
is increasing in both directions away from this value. This minimum does not depend on the Vj values 
themselves; instead, it occurs (approximately) when Pj,s,e ~ Pjs e ~ '^b/S, where = (e — s + 1) as 
before. It therefore suffices to find the Vj value defined by 

^;^. = argmin Pj,s,e+ J2 ^Le 

Vj<V(EV Vj>V(SV 

and then set b = Vji to obtain the optimal SAE cost. 
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Tuple and Value pdf models. In the value pdf case, it is straightforward to compute the P and P* values 
directly from the input pdfs. For the tuple pdf case, observe that from the form of the expression for 
SAE, there are no interactions between different gi values: although the input specifies interactions and 
(anti)correlations between different variables, for computing the error in a bucket we can treat each item 
independently in turn. We therefore convert to the induced value pdf (at an additional cost of 0(m| V|)), and 
use this in our subsequent computations. 

Efficient Computation. In order to quickly find the cost of a given bucket b, we must find the optimal b and 
the cost of the bucket using b as the representative. Our approach is to precompute ^^ ^^ Pj^i^e{vj+i — vj) 

and J2v >e ^ji e(^i+i values for all values of G V and e G [n]. Now S AE(6, b) for any 6 G V can be 
computed (from ([8])) as the sum of two differences of precomputed values. The minimum value attainable 
by any b can then be found by a ternary search over the values V, using 0(log | V|) probes. Finally, the cost 
for the bucket using this b is also found from the same information. The cost is 0(|V|n) preprocessing to 
build tables of prefix sums, and then 0(log | V|) to find the optimal cost of a given bucket. Therefore, 

Theorem 3. Optimal sum-absolute-error histograms can be computed over probabilistic data presented in 
the (induced) value pdf model in time 0(n(|V| + Bn + nlog |V|)). 

Note that for all of models of probabilistic data, |V| < m is polynomial in the size of the input, so this 
is a fully polynomial time algorithm. 



3.4 Sum-Absolute-Relative-Error 



For sum of absolute relative errors, the bucket cost S ARE(6, b) is 

p /v^e \gi-h\ s _ Pr[3»=-"j] i i\ 

where we define Wi^ = max(c v ) ■ "lor^ generally, the Wij can be arbitrary non-negative weights. 
Setting j' so that Vj' < b < we can write the cost as 



EE 



Wij{b — Vj) if Vj < b 



^ I - b + Ev^, Vi+i - ve) if Vj > b 
We define Wij = Er=i '^i,r ^^'^ ^ij ~ EI='j+i ^i.f ^^^^ rearranging the previous sum, the cost is 

e 

SARE(6, 6) = Y,Wij,(b - Vj,) - w:j(i, - Vj,+j) 



+ E 



Wij{vj+i — Vj) for Vji > Vj 

W*j{Vj-Vj-i) fOTVy<Vj 



The same style of argument as in the previous section suffices to show that the optimal choice of b is 
when b = Vji for some j'. We define Pj^s^e = Ei=s ^j,s,e = Yli=s ^ie ^^"^^ 

SARE(6, b)= PiMvj+i - Vj) + PlsM+i - ^3)- 
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Observe that this matches the form of (|8]l. As in Section 3.3 Pj,s,e is monotone increasing in j, and P* 



is decreasing in j. Therefore, the same argument holds to show that there is a unique minimum value of 
S ARE, and it can be found by a ternary search over the range. Likewise, the form of the cost in equation (|9]) 
shows that there are no interactions between different items, and so we can work in the (induced) value pdf 
model. By building corresponding data structures based on tabulating prefix sums of the new P and P* 
functions, we conclude: 

Theorem 4. Optimal sum absolute relative error histograms can be computed over probabilistic data pre- 
sented in the tuple and value pdf models in time 0{n{\V\ + Bn + n log | V| log n)). 



3.5 Approximate Histogram Computation 

Our results so far all cost at least Q{Bv?) due to the use of dynamic programming to find the optimal bucket 
boundaries. As observed in prior work, it is not always profitable to expend so much effort if the resulting 
histogram is to be used to approximate the original input; clearly, if we tolerate approximation in this way, 
then we should also be able to tolerate a histogram which achieves close to the optimal cost rather than 
exactly the optimal. In particular, we should be happy if we can find a histogram whose cost is at most 
(1 + e) times the cost of the optimal histogram in time much faster than U,{Bn^). 

Here, we can adopt the approach of Guha et al. llT3l [141 . Instead of considering every possible bucket, 
we can use properties of the error measure, and consider only a subset of possible buckets, much accelerating 
the search. We observe that the following conditions hold for all the previously considered error measures: 
(1) The error of a bucket only depends on the size of the bucket and the distributions of the items falling 
within it; (2) The overall error is the sum of the errors across all buckets; (3) We can maintain information 
so that given any bucket b the best representative h and corresponding error can be computed efficiently; (4) 
The error is monotone, so that the error for any interval of items is no less than the error of any contained 
subinterval; and (5) The total error cost is bounded as a polynomial in the size of the input. 

Note that most of our work so far has been in giving analysis and techniques to support point (3) above; 
the remainder of the points are simple consequences of the definition of the error measure. As a result of 
these properties, we invoke Theorem 6 of [14J, and state 

Theorem 5. Given preprocessing as described in previous sections, we find a {1 + e) approximation to the 
optimal histogram for SSE, SSRE, S AE and S ARE, with 0{^B'^n log n) bucket cost evaluations using the 
algorithm described in l\14J. 



3.6 Maximum- Absolute-Error and Maximum- Absolute-Relative-Error 

Thus far we have relied on the linearity of expectation and related properties such as summability of variance 
to simplify the expressions of cost and aid in the analysis. When we consider other error metrics, such as 
the maximum error and maximum relative error, we cannot immediately use such linearity properties, and 
so the task becomes more involved. Here, we provide results for maximum absolute error and maximum 
absolute relative error, MAE and MARE. Over a deterministic input, the maximum error in a bucket b is 

MAE(6, b) = maxs<j<e \gi — b\, and the maximum relative error is MARE{b,b) = maxs<i<e J^^'^^ly 

We focus on bounding the maximum value of the per-item expected erroi[^ Here, we consider the frequency 

' Note that the alternate formulation, where we seek to minimize the expectation of the maximum error, is also plausible, and 
worthy of further study. 
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of each item in the bucket in turn for the expectation, and then take the maximum over these costs. So we 
can write the costs as 



MAE(6, b) = max Pr[gj = Vj]\vj - b\ 

s<i<e ^ — ' 

MARE(6,6) = max V - b\ 



s<i<e ^ maxfc, Wi) 

We can represent these both as maxs<j<e X^j=i Wi^j\vj — b\, where Wi^j are non-negative weights indepen- 
dent of b. Now observe that we have the maximum over what can be thought of as parallel instances of 



a sum-absolute relative error (SARE) problem, one for each i value. Following the analysis in Section 3.4 



we observe that each function = — b\ has a single minimum value, and is increasing away 

from its minimum. It follows that the upper envelope of these functions, given by maxs<i<e also has a 
single minimum value, and is increasing as we move away from this minimum. So we can perform a ternary 
search over the values of Vj to find j' such that the optimal b lies between Vji and fj'+i. Each evaluation 
for a chosen value of b can be completed in time 0(n;,): that is, 0(1) for each value of i, by creating the 
appropriate prefix sums as discussed in Section |3.4| (it is possible to improve this cost by appropriate pre- 
computations, but this will not significantly alter the asymptotic cost of the whole operation). The ternary 
search over the values in V takes 0(log | V|) evaluations, giving a total cost of 0{ni, log | V|). 
Knowing that b must lie in this range, the cost is of the form 

MARE(6, b) = max aj(6 - Vf) + (^iivjij^x - &) + 7i 

s<i<e 

= max b{ai - Pi) + {ji + (3iVj'+i - OiVf) 

s<i<e 

where the Oj, 7^ values are determined solely by j', the Wij's and the VjS, and are independent of b. 
This means we must now minimize the maximum value of a set of univariate linear functions in the range 
Vji < b < Vji^i. A divide-and-conquer approach, based on recursively finding the intersection of convex 
hulls of subsets of the linear functions yields an 0(nfe log n^) time algorithrrj^ Combining these, we deter- 
mine that evaluating the optimal b and the corresponding cost for a given bucket takes time 0{ni, log n;,| V|). 
We can then apply the dynamic programming solution, since the principle of optimality holds over this error 
objective. Because of the structure of the cost function, it suffices to move from the tuple pdf model to the 
induced value pdf, and so we conclude. 

Theorem 6. The optimal B bucket histogram under maximum-absolute-error fMAE ) and maximum-absolute- 
relative-error fMAREj over probabilistic data in either the tuple or value pdf models can be found in time 
0{n'^{B + n\ogn\V\)). 



4 Wavelets on Probabilistic Data 

We first present our results on the core problem of finding the B term optimal wavelet representation under 
sum-squared error, and then discuss extensions to other error objectives. 

^The same underlying optimization problem arises in a weighted histogram context 1151 . which gives full details of this ap- 
proach. 
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(a) Squared relative error with c = 0.5 (b) Squared relative error with c — 1.0 



(c) Squared error 
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(f) Absolute error 



Figure 2: Results on Histogram Computation 



4.1 SSE-Optimal Wavelet Synopses. 

Any input which defines a distribution over original gi values immediately implies a distribution over Haar 
wavelet coefficients Cj. In particular, we have a possible-worlds distribution over Haar DWT coefficients, 
with Ci{W) denoting the instantiation of Cj in world W (defined by the gi{W)s). Our goal is to pick a 
set of B coefficient indices T and corresponding coefficient values Cj for each i G T so as to minimize 
the expected SSE in the data approximation. By Parseval's theorem (TS] and the linearity of the wavelet 
transform, in each possible world, the SSE of the data approximation is simply the SSE in the approximation 
of the normalized wavelet coefficients. Thus, by linearity of expectation, the expected SSE for the resulting 
synopsis Sw{T) is: 

Ew[SSE(cS^(T))] = - h?] + 

Some observations follow immediately. Suppose we are to include i in our index set I of selected coeffi- 
cients. Then the optimal setting of Cj is the expected value of the i^^ (normalized) Haar wavelet coefficient, 
by the same argument style as Fact[T] That is, 

He, = Ew[q] = Pr[ci = Wj]- Wj, 

computed over the set of values taken on by the coefficients, Wj. Further, by linearity of expectation and the 
fact that the Haar wavelet transform can be thought of as a linear operator H applied to the input vector A, 
we have 

^^,^ = EMHi{A)]=Hi{EMA]). 
In other words, we can find the /^c/s by computing the wavelet transform of the expected frequencies. 
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Ew(5j)- So the /ic-s can be computed with linear effort from the input, in either tuple pdf or value pdf form. 
Based on the above observation, we can rewrite the expected SSE as: 



Ew[SSE(5^(J))] = ^4 + E[(q)'], 

where a^. = Varyy[cj] is the variance of q. From the above expression, it is clear that the optimal strategy 
is to pick the B coefficients giving the largest reduction in the expected SSE (since there are no interactions 
across coefficients); furthermore, the "benefit" of selecting coefficient i is exactly E[(cj)^] — cx^ = //^ . Thus, 
the thresholding scheme that optimizes expected SSE is to simply select the B Haar coefficients with the 
largest (absolute) expected normalized value. (It is interesting to note that this scheme naturally generalizes 



the conventional deterministic SSE thresholding case (Section 2.2 1.) 



Theorem 7. With 0{n) time and space, we can compute the optimal SSE wavelet representation of proba- 
bilistic data in the tuple and value pdf models. 

4.2 Wavelet Synopses for non-SSE Error. 

The DP recurrence formulated over the Haar coefficient error tree for non-SSE error metrics in the deter- 
ministic case (Section |2.2| ) extends naturally to the case of probabilistic data. The only change is that we 
now define OPTw[j, 6, to denote the expected optimal value for the error metric of interest under the 
same conditions as the deterministic case. The recursive computation steps remain exactly the same. The 
interesting point with the coefficient-tree DP recurrence is that almost all of the actual error computation 
takes place at the leaf (i.e., data) nodes of the tree — the DP recurrences simply combine these computed 
error values appropriately in a bottom-up fashion. For deterministic data, the error at a leaf node i with an 
incoming value of v from its parents is just the point error metric of interest with gi = v; that is, for leaf 
i, we simply compute OPTw[i, 0, v] = err((7j, v) which can be done trivially in 0(1) time (note that leaf 
entries in OPTw[] are only defined for 5 = since space is never allocated to leaves). 

In the case of probabilistic data, such leaf-error computations are a little more complicated since we now 
need to compute the expected point-error value 



Ew[err(5i,t;)] = ^Pr[VF] • err{g,{W),v) 



W 

over all possible worlds W G W. Fortunately, this computation can still be done in 0(1) time assum- 
ing some simple precomputed data structures, similar to those we have derived for error objectives in 
the histogram case. To illustrate the main ideas, consider the case of absolute relative error metrics, i.e., 
err{gi,gi) = w{gi) ■ \gi — gi\ where w{gi) = l/max{c, \gi\}. Then, we can expand the expected error at 
gi as follows: 



OPTw[i,0,?;] =^w[w{9i) ■ \9i-v\] 



{v — Vj) if f > Vj 
{vj — v) if f < Vj 



where, as earlier, V denotes the set of possible values for any frequency random variable gi. In other words, 
we have an instance of a sum-absolute-relative-error problem, since the form of this optimization matches 
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that in Section 3.3 So by precomputing appropriate arrays of size 0(|V|) for each i, we can search for the 
optimal "split point" Vji G V in time 0(log | V|). 

The above precomputation ideas can naturally be extended to other error metrics as well, and allow us 
to easily carry over the algorithms and results (modulo the small 0(log |V|) factor above) for the restricted 
case, where all coefficient values are fixed, e.g., to their expected values as required for expected SSE 
minimization. The following theorem summarizes our discussion. 

Theorem 8. Optimal restricted wavelet synopses for non-SSE error metrics can be computed over proba- 
bilistic data presented in the (induced) value pdf model in time 0{n{\V\ + nlog |V|)). 

For the unrestricted case, some additional work is needed in order to effectively bound and quantize the 
range of possible coefficient values to consider in the case of probabilistic data at the leaf nodes of the tree. 
One option is to consider pessimistic coefficient-range estimates (e.g., based on the minimum/maximum 
possible frequency values); another option would be to employ some tail bounds on the gij's (e.g., using 
Chernoff since tuples can be seen as binomial variables) in order to derive tighter, high-probability ranges 
for coefficient values. We defer a more detailed exploration to the full version of this paper. 

5 Experiments 

We implemented our algorithms in C, and carried out a set of experiments to compare the quality and 
scalability of our results against those from naively applying methods designed for deterministic data. Ex- 
periments were performed on a desktop 2.4GHz machine with 2GB RAM. 

Data Sets. We experimented using a mixture of real and synthetic data sets. The real dataset came from the 
MystiQ projecj^which includes approximately m = 127, 000 tuples describing 27, 700 distinct items. These 
correspond to links between a movie database and an e-commerce inventory, so the tuples for each item 
define the distribution of the number of expected matches. This uncertain data provides input in the basic 
model, the items for various subsets of the relation. Synthetic data was generated using the MayBMS [1] 
extension to the TPC-H generator]^ We used the lineitem-partkey relation, where the multiple possibilities 
for each uncertain item are interpreted as tuples with uniform probability over the set of values in the tuple 
pdf model. 

Sampled Worlds and Expectation. We compare our methods to two naive methods of building a synopsis 



for uncertain data using deterministic techniques discussed in Section 2.3 The first is to simply sample a 
possible world, and compute the (optimal) synopsis for this deterministic sample. The second is to compute 
the expected frequency of each item, and build the synopsis of this deterministic input. This can be thought 
of as equivalent to sampling many possible worlds, combining and scaling the frequencies of these, and 
building the summary of the result. For consistency, we use the same code to compute the respective 
synopses over both probabilistic and certain data, since deterministic data can be interpreted as probabilistic 
data in the value pdf model with probability 1 of attaining a certain frequency. 

5.1 Histograms on Probabilistic Data 

We use our methods described in Section[3]to build the histogram over n items using B buckets, and compute 
the cost of the histograms under the relevant metric (e.g. the expected sum-relative-error, etc.). Observe that, 



http : / / www . cs . Washington . edu/ homes/ suciu/pro ject-mystiq . html 



www . cs . Cornell .edu/ database/maybms/ 
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unlike in the deterministic case, a histogram with B = n buckets does not have zero error: we have to choose 
a fixed representative b for each bucket, so any bucket with some uncertainty will have a contribution to the 
expected cost. We therefore compute the percentage error of a given histogram as the fraction of the cost 
difference between the one bucket histogram (largest achievable error) and the n bucket histogram (smallest 
achievable error). 

Quality. For uniformity, we show results for the MystiQ movie data set; results on synthetic data were 
similar, and are omitted for space reasons. The quality of the different methods on the same n = 10, 000 
distinct data items is shown in Figure[2] In each case, we measured the cost for using up to 1000 buckets over 
the four cumulative error measures: sum-squared-error, sum-squared-relative-error, sum-absolute-error and 
sum-relative-error. We show two values of the sanity constant for relative error, c = 0.5 and c = 1.0. Since 
our results show that the dynamic programming finds the optimum set of buckets, there is no surprise that 



the cost is always smaller than the two naive methods. Figure 2(a) shows a typical case for relative error: the 
probabilistic method is appreciably better than using the expected costs, which in turn is somewhat better 
than building the histogram of a sampled world. We show the results for three independent samples to show 



that there is fairly little variation in the cost. For the sum-squared-error and sum-absolute-error (Figure 2(c) 



and 2(f) I, while using a sampled world is still poor, the cost of using the expectation is very close to that 
of our probabilistic method. The reason is that the histogram obtains the most benefit by putting items 
with similar behavior in the same bucket, and on this data, the expectation is a good indicator of behavioral 



similarity. This is not always the case, and indeed. Figure 2(f) shows that while our method obtains the 
smallest possible error with about 600 buckets, using the expectation finds a bucketing with a slightly higher 
cost. Other values of c tend to vary smoothly between two extremes: increasing c allows the expectation 



method to get closer to the probabilistic solution (as in Figure 2(e)| ). This is because as c approaches the 



maximum achievable frequency of any item, there is no longer any dependency on the frequency, only on 
c, and so the cost function is essentially a scaled version of the squared error or absolute error respectively. 
Reducing c towards further disadvantages the expectation method, and it has close to 100% error even 
when a very large number of buckets are provided; meanwhile, the probabilistic method smoothly reduces 
in error down to zero. 

Scalability. Figure |3] shows the time cost of our methods. We show the results for sum squared error, 
although the results are very similar for other metrics, due to a shared code base. We see a strong linear 
dependency on the number of buckets, B, and a close to quadratic dependency on n (since as n doubles, the 
time cost slightly less than quadruples). This confirms our analysis that shows the cost is dominated by an 
0{B'n?) term. The time to apply the naive methods is almost identical, since they both ultimately rely on 
solving a dynamic program of the same size, which dwarfs any reduced linear preprocessing cost. Therefore, 
the cost is essentially the same as for deterministic data. The time cost is acceptable, but it suggests that for 



larger relations it will be advantageous to pursue the faster approximate solutions outlined in Section 3.5 



5.2 Wavelets on Probabilistic Data 

We implemented our methods for computing wavelets under the sum-squared-error (SSE) objective. Here, 
the analysis shows that the optimal solution is to compute the wavelet representation of the expected data 
(since this is equivalent to generating the expected values of the coefficients, due to linearity of the wavelet 
transform function), and then pick the B largest coefficients. We contrast to the effect of sampling possible 
worlds and picking the coefficients corresponding to the largest coefficients of the sampled data. We measure 
the error by computing the sum of the square of the s not picked by the method, and expressing this as 
a percentage of the sum of all such /Xc.s, since our analysis demonstrates that this is the range of possible 
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Figure 3: Histogram Timing Costs 



error (Section 4. 1 1. Figures 4(a) and 4(b) sliows the effect of varying tiie number of coefficients B on the 
real and synthetic data sets: while increasing the number of coefficients improves the cost in the sampled 
case, it is much more expensive than the optimal solution. Both approaches take the same amount of time, 
since they rely on computing a standard Haar wavelet transform of certain deterministic data: it takes linear 
time to produce the expected values and then compute the coefficients; these are sorted to find the B largest. 
This took much less than a second on our experimental set up. 



6 Concluding Remarks 

We have introduced the probabilistic data reduction problem, and given results for a variety of cumulative 
and maximum error objectives, for both histograms and wavelets. Empirically, we see that the optimal 
synopses can accurately summarize the data and are significantly better than simple heuristics. It remains to 
better understand how to approximately represent and process probabilistic data. So far, we have focused on 
the foundational one-dimensional problem, but is also important to study multi-dimensional generalizations. 
The error objective formulations we have analyzed implicitly assume uniform workloads for point queries, 
and so it remains to address the case when in addition to a distribution over the input data, there is also a 
distribution over the queries to be answered. 

Acknowledgements. We thank Sudipto Guha for some useful suggestions; Dan Suciu for providing data 
from the MystiQ project, and Dan Olteanu for generating data with the MayBMS system. 
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